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Abstract 

We apply the concept of subset seeds proposed in JT] to similarity search in protein sequences. 
The main question studied is the design of efficient seed alphabets to construct seeds with optimal 
sensitivity /selectivity trade-offs. We propose several different design methods and use them to 
construct several alphabets. We then perform a comparative analysis of seeds built over those 
alphabets and compare them with the standard Blastp seeding method [2], Q, as well as with 
the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive 
(but less costly to implement) than the cumulative principle used in Blastp and vector seeds, our 
seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins 
compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking 
of our seeds against several main databases of protein alignments. Here again, the results show a 
comparable or better performance of our seeds vs. BLASTP. 

Index Terms 

protein sequences, protein databases, local alignment, similarity search, seeds, subset seeds, 
multiple seeds, seed alphabet, sensitivity, selectivity 

I. Introduction 

Similarity search in protein sequences is probably the most classical bioinformatics prob- 
lem, and a commonly used algorithmic solution is implemented in the ubiquitous Blast 
software 0, 0. On the other hand, similarity search algorithms for nucleotide sequences 
(DNA, RNA) underwent several years ago a significant improvement due to the idea of spaced 
seeds and its various generalizations 0, (6]|, 0, [H, S, ifTOl . This development, however, 
has little affected protein sequence comparison, although improving the speed/precision trade- 
off for protein search would be of great value for numerous bioinformatics projects. Due to 
a bigger alphabet size, protein seeds are much shorter (typically 2-5 letters instead of 10-20 
letters in the DNA case) and also letter identity is much less relevant in defining hits than 
in the DNA case. For these reasons, the spaced seeds technique might seem not to apply 
directly to protein sequence comparison. 

Recall that Blast applies quite different approaches to protein and DNA sequences to 
define a hit. In the DNA case, a hit is defined as a short pattern of identically matching 
nucleotides whereas in the protein case, a hit is defined through a cumulative contribution of 
a few amino acid matches (not necessarily identities) using a given scoring matrix. Defining 
a hit through an additive contribution of several positions is captured by a general formalism 
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of vector seeds proposed in ifTTTl . On the other hand, it has been understood Q, lfl2ll . lfT3l . 
|[T4l . ifTBI that using simultaneously a family of seeds instead of a single seed can further 
improve the sensitivity/selectivity ratio. Papers [|4]|, lfl6l both propose solutions using a family 
of vector seeds to surpass the performance of BLAST. 

However, using the principle of cumulative score over several adjacent positions has an 
algorithmic cost. Defining a hit through a pattern of exact letter matches allows for a direct 
hashing scheme, where each key of the query sequence is associated with a unique hash 
table entry pointing to the positions of the subject sequence (database) where the key can 
hit. Usually these positions are stored in consecutive memory cells within the hash table. 

On the other hand, defining a hit through a cumulative contribution of several positions 
leads to an additional pre-computed table that stores, for each key, its neighborhood i.e., 
the list of subject keys (or corresponding hash table entries) with which it can form a hit. 
For example, in a standard Blastp setting (Blosum62 scoring matrix with threshold 11 for 
cumulative score of three positions), the expectation, computed according to the Bernoulli 
sequence model, of the number of neighbors of a key is 19.34, i.e. that many accesses to 
the hash table are required for each key. For four positions and threshold 18, as in the case 
of seeds from [0, a key hits expectedly 15.99 keys and this number grows up to 45.59 
when the score threshold decreases to 16. This raises an obvious memory problem: for 
example, for key size 4 and score threshold 18, the total size of neighborhoods is 7609575, 
and for key size 5 the neighborhood table may simply not fit into the memory. Another 
related implementation problem is cache usage: different keys of a neighborhood generally 
correspond to remote segments of the hash table and their processing gives rise to cache 
misses that cause additional latencies. 

Those implementation issues may become a bottleneck in large-scale protein comparisons. 
Furthermore, solving these problems may be very helpful in different specific experimental 
setups, such as in mapping protein comparison algorithms to specialized computer architecture 
(see e.g. [[171 . lfT8T0 where memory usage may be a crucial issue. 

In 01, we proposed a new concept of subset seeds that can be viewed as an intermediate 
between ordinary spaced seeds and vector seeds: subset seeds allow one to distinguish 
between different types of mismatches (or matches) but still treat seed positions independently 
rather than cumulatively. Distinguishing different mismatches is not done by scoring them, 
but by extending the seed alphabet such that each seed letter specifies different sets of 
mismatches. For example, in the DNA case it is beneficial to distinguish between transition 
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mutations (A <-> G, C ^ T) and others (transversions) lPT9ll , GUI . This leads (at least in the 
case of transitive seed alphabets defined in this paper) to the possibility of using the direct 
hashing. 

Since the protein alphabet is much larger than the one of DNA, subset seeds provide a 
very attractive seeding option for protein alignment. In this paper, we study the performance 
of subset seeds applied to protein sequences and compare it to existing seeding techniques 
of Blastp and vector seeds. 

Note again that subset seeds are less expressive than Blast seeds or vector seeds in 
general, but in return, admit a more efficient implementation. Besides treating positions 
independently, subset seeds replace amino acid substitution scores by simply distinguishing 
different classes of mismatches. Therefore, another way to state the motivation of this work 
is to ask whether scores are really necessary at the seeding stage of protein alignment. We 
will show that with a reasonable level of precision the answer to this question is negative. 

In the paradigm of subset seeds, each seed letter specifies a set of amino acid pairs matched 
by this letter. Therefore, a crucial question is the design of an appropriate seed alphabet, 
which is one of the main problems we study in this paper. In fine, the quality of an alphabet 
is determined by the quality of the best seeds that can be constructed over this alphabet. 
The latter is already a complex optimization problem that is usually solved in practice by 
heuristic methods. (For a formal analysis of seed design problem we refer to the recent paper 
11211 and references therein.) The problem of alphabet design studied in this paper presents 
an additional complexity as it introduces an additional dimension of the search space (set of 
possible alphabets), and additionally requires a study of selectivity/sensitivity dependencies 
rather than simply maximizing the sensitivity for a class of seeds with a given selectivity. 
In this paper we propose several heuristic methods that lead to the design of efficient seed 
alphabets and corresponding seeds. 

The paper is organized as follows. In Section UH we introduce some probabilistic notions we 
need to reason about seed efficiency. Section UIT1 introduces the first simple approach to design 
a seed alphabet, which, however, does not lead to so-called transitive seeds, useful in practice. 
Section |IV] presents three different approaches to designing transitive seed alphabets, based 
on a pre-defined (Section IIV-AI) or newly designed (Section IIV-BI) hierarchical clustering of 
amino acids, as well as on a non-hierarchical clustering (Section [IV-CI) . Section IVl describes 
comparative experiments made with the designed seeds, obtained both on probabilistic models 
and on different protein data banks. 
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II. Preliminaries 

Throughout the paper, we denote S = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} = 
{ai}i=i..2o the alphabet of amino acids. 

In most general terms, a (subset) seed letter a is defined as any symmetric and reflexive 
binary relation on S. Let B be a seed alphabet, i.e. a collection of subset seed letters. Then 
a subset seed tc = a% . . . is a word over B, where k is called the span of it. n defines 
a symmetric and reflexive binary relation on words of S fc (called keys): for S\,S2 E 
si ~,r S2 iff Vi G [l..fc], we have (si[i), sz[i]) E an. In this case, we say that seed a hits the 
pair si, s 2 . 

For practical reasons, we would like seed letters to define a transitive relation, in addition. 
This induces an equivalence relation on keys, which is very convenient and allows for an 
efficient indexing scheme (see Introduction). In this paper, we will be mainly interested in 
transitive seed letters, but we will also study the non-transitive case in order to see how 
restrictive the transitivity condition is. 

The quality of a seed letter or of a seed is characterized by two main parameters: sensitivity 
and selectivity. They are defined through background and foreground probabilistic models of 
protein alignments. Foreground probabilities are assumed to represent the distribution of 
amino acids matches in proteins of interest, when two homologous proteins are aligned 
together. Background probabilities, on the other hand, represent the distribution of amino 
acid matches in random alignments, when two proteins are randomly aligned together. 

In this paper, we restrict ourselves to Bernoulli models of proteins and protein alignments, 
although some of the results we will present can be extended to Markov models. 

Assume that we are given background probabilities . . . , 6 20 } of amino acids in protein 
sequences under interest. The background probability of a seed letter a is defined by b(a) = 
S(oj a )ea bibj- The selectivity of a is 1 — b(a) and the weight of a is defined by 

log 6(a) 

w a = i 7717V, (!) 

log&(#) 

where # = {{a, a)\a E S} is the "identity" seed letter. For a seed n = ai.-.a^, the 
background probability of it is b(n) = nLiK ^)' me selectivity of n is 1 — 6(7r) and the 
weight of Ti is w(n) = log 6 ( # ) b(n) = Yli=i w ( a i)- Note that the weight here generalizes the 
weight of classical spaced seeds [|22i defined as the number of "identity" letters it contains. 

Let fij be the probability to see the pair (a i; aj) aligned in a target alignment. The 
foreground probability of a seed letter a is defined by f(a) = Y2( a a )ea Ar ^ ne sens itivity 
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of a seed n is defined as the probability to hit a random target alignment Assume that target 
alignments are specified by a length N. Then the sensitivity of a seed n = ol\ . . . a.^ is the 
probability that a randomly drawn gapless alignment (i.e. string of pairs (aj, af)) of length 
iV contains a fragment of length k which is matched by it. In [JTJ we proposed a general 
algorithm to efficiently compute the seed sensitivity for a broad class of target alignment 
models. This algorithm will be used in the experimental part of this work. 

The general problem of seed design is to obtain seeds with good sensitivity/selectivity 
trade-off. Even within a fixed seed formalism, the quality of a seed is dependent on the 
chosen selectivity value. This is why we will always be interested in computing efficient 
seeds for a large range of selectivity levels. 

III. Dominating seed letters 

Our main question is how to choose seed letters that form good seeds? Intuitively, "good 
letters" are those that best distinguish foreground and background letter alignments. 

For each letter a, consider its foreground and background probabilities f(a) and b(a) 
respectively. Intuitively, we would like to have letters a with large f(a) and small b(a). A 
letter a is said to dominate a letter (3 if f(a) > /(/3) and b(a) < b{(3). Observe that in this 
case, (3 can be removed from consideration, as it can always be advantageously replaced by 
a. 

Consider all amino acid pairs (a*, a,) ordered by descending likelihood ratio fij/bibj. 
Consider the set of pairs R(t) = {(aj, aj) \ fij/bibj > t}. Then the following statement 
holdfl 

Proposition 1: R(t) cannot be dominated by any other letter. 

Proof: Assume by contradiction that R(t) is dominated by some letter a, i.e. f(a) > 
f(R(t)) and b(a) < b(R(t)). Consider (3 = R(t) \ a and 7 = a \ R(t). Clearly, f(j3) < 
f(j) and b((3) > 6(7). On the other hand, V(aj, aj) £ (3, fij/bibj > t and V(aj,aj) £ 7, 
fij/hbj < t. This implies that f(f3) = E( ai , % )e/3 fa > *E(a„ flj )ep b * b j = th (P) and similarly 

'Note that our definitions of sensitivity and selectivity are not symmetric: sensitivity is defined with respect to the entire 
alignment and selectivity with respect to a single alignment position. These definitions capture better the intended parameters 
we want to measure. However, selectivity could also be defined with respect to the entire alignment. We could suggest the 
term specificity for this latter definition. 

2 It is interesting to point out the relationship to the Neyman-Pearson lemma which is a more general formulation of this 
statement. 
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Fig. 1 



Amino acid pairs forming letter R(l) of alphabet non-transitive 



f(i) < tb(j). We then have f((3) > tb(/3) > tb(j) > f(j) which contradicts f((3) < f(j). 

■ 

Proposition Q] suggests that letters R(t) are good candidates to be included to the seed 
alphabet. 

Resulting alphabet. We computed the likelihood ratio for all amino acid pairs, based on 
practical values of background and foreground probabilities computed in accordance with the 
BLOSUM62 matrix (see Section IV-AI) . Not surprisingly, amino acid identities (pairs (a, a)) 
have highest likelihood scores varying from 38.11 for tryptophan (W) down to 3.69 for valine 
(V). 

Among non-identical pairs, only 25 have a score greater than 1 (Figured]). A quick analysis 
shows that those do not form a transitive relation, and therefore -R(l) does not verify the 
transitivity requirement. This is also the case for other threshold values. 

We analyzed a family of threshold letters R(t) for t ranging from to 3 with step 0.05. 
At the extremities of this interval, i?(0) is the "joker" letter admitting all amino acid pairs, 
and -R(3) is the letter corresponding to the exact match relation. Among all those letters, 
there are only 34 different ones. This alphabet of 34 letters (data not shown), denoted 
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Non-transitive, will be used in the experimental part of the paper (Section IVT) in order 
to study how restrictive the requirement of transitive letters is, i.e. how much better are 
general seeds than those obtained with the restriction of transitivity. 

IV. Transitive seed alphabets 

In the case of transitive seed alphabets, every letter a G B is a partition of the amino acid 
alphabet E. In other words, the binary relation associated with each letter (cf Section UTJ) is 
an equivalence relation. Transitive alphabets represent the practical case when each amino 
acid is uniquely mapped to its equivalence class. This, in turn, allows for an efficient hashing 
scheme during the stage of seed search, when different entries of the hash table index non- 
intersecting subsets of keys. 

In Sections IIV-AI and IIV-BI below, we explore transitive seed alphabets satisfying an 
additional "hierarchy condition": for any two seed letters a\,a 2 G & corresponding to 
partitions P ai , P a2 respectively, one of P ai , P a2 is a refinement of the other. Formally, 

for any a±, a 2 G £>, either a\ -< a 2 , or a 2 -< a±, (2) 

where a -< (3 means that every set of Pg is a subset of some set of P a . 

The purpose of the above requirement is to define seed letters using a biologically signifi- 
cant hierarchical clustering of amino acids represented by a tree. In Section ITV-Al we will use 
a pre-defined hierarchical clustering to design efficient seed alphabets. Then in Section ITV-B[ 
we construct our own clustering based on appropriate background and foreground models 
of amino acids distribution. Finally, in Section IIV-CI we lift condition © and study "non- 
hierarchical" seed alphabets. 

A. Transitive alphabets based on a pre-defined clustering 

Assume we have a biologically significant hierarchical clustering tree which is a rooted 
binary tree T with 20 leaves labeled by amino acids. Such trees have been proposed in 
11231 . [|24|. based on different similarity relations. The hierarchical tree derived from 11231 is 
shown in Figure |2] The tree, obtained with a purely bioinformatics analysis, groups together 
amino acids with similar biochemical properties, such as hydrophobic amino acids L, M, I , V, 
hydrophobic aromatic amino acids F,Y,W, amino acids with an alcohol group S,T, or 
charged/polar amino acids E, D, N, Q. A similar grouping has been obtained in [|24|. 

A seed letter is defined here as a subset a of nodes of T such that 
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Fig. 2 

Hierarchical tree derived from ll23ll 



(i) a contains all leaves, 

(ii) for a node v, if v E a, then all descendants of v belong to a too. 

In other words, a seed letter can be thought of as a "horizontal cut" of the tree. Clearly, each 
letter induces a partition on the set of leaves (amino acids). For example, for the tree on 
Figure H a letter defined by the cut through nodes C, FYW, MLIV, G, P, ATS, NHQEDRK 
corresponds to the partition {{C}, {FYW}, {MLIV}, {g}, {p}, {ATS}, {NHQEDRK}}. 

Seed letters are naturally ordered by inclusion. The smallest one is the "identity" seed 
letter containing only the leaves. The largest one is the "joker" seed letter _, containing 
all the nodes of T. One particular seed letter is obtained by removing from _ the root node. 
We denote it by @. 

Observe that each seed letter a represents naturally an equivalence relation on S: a* and 
a,j are related iff their common ancestor belongs to a. It is identity relation in case of # and 
full relation in case of _ . 

Following condition ©, a hierarchical seed alphabet is a family B of seed letters such 
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that 

for every ai,a 2 G B, either a\ C a 2 or a; 2 C a x . (3) 

Hence, in mathematical terms, a seed alphabet is a chain in the inclusion ordering of seed 
letters. Each hierarchical alphabet can be obtained by a series of refinements (set splittings) 
of its least refined letter. 

Let us analyze what are the maximal seed alphabets wrt. inclusion. Clearly each maximal 
seed alphabet B always contains the smallest and the largest seed letters # and _ . Interest- 
ingly, each maximal alphabet B contains also @, as @ is comparable (by inclusion) to any 
other seed letter. 

It can be shown that under the above definitions, any maximal seed alphabet contains 
exactly 20 letters that can be obtained by a stepwise merging of two subtrees rooted at 
immediate descendants of some node v into the subtree rooted at v. Therefore, since a binary 
tree with n leaves contains n — 1 internal nodes, a maximal seed alphabet contains precisely 
20 letters and can be specified by a permutation of internal nodes in tree T. 

Seed alphabets and constraint independence systems. It is interesting to observe that the 
set of seed alphabets forms a constrained independence system [|25l . An independence system 
is a collection of subsets X C 2 E over a ground set E, called independent sets, such that (i) 
G X, and (ii) if X G X and Y C X, then Y G X. A maximal (w.r.t. inclusion) independent 
set is called a base. 

Let E be the set of all possible seed letters as defined earlier. Then alphabets verifying (0) 
form an independence system, where bases correspond to maximal seed alphabets. Moreover, 
seed alphabets verify two additional conditions of constrained independent system ll25l : ( Hi) 
if X, Y G X with \Y\ < \X\, then there is an element e G E \ Y such that Y U {e} G X, and 
(iv) the cardinality of every minimal (w.r.t inclusion) set of 2 E \X is two. 

The interest of this observation follows from results of G51 showing that some optimization 
problems on constrained independence systems can be solved efficiently by greedy algorithms. 
Assume we have a score function s : E — > R that we extend additively to independent sets 
by s(X) = 2~2eex s ( e )- For an independence system X, we want to find a base X G X 
with optimal (maximal or minimal) s(X). For constrained independence systems, it was 
proved ll2~5ll that the greedy algorithm yields a base which is locally optimal, i.e. better than 
any neighbor base Y — (X \ {oti}) U {a 2 } for some a\ G X, a 2 G E \ X. Here, the greedy 
algorithm starts with the empty set and iteratively adds most optimal elements of E as long 
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as the current set remains independent. The absolute optimum is hard to compute in general, 
and the greedy solution is an approximation of it. 

Assigning letter score. The above setting requires that each letter a is assigned a score 
that, intuitively, should measure the "usefulness" of a in a potential alphabet. Defining such a 
measure is a difficult question as there are too many potential alphabets and we can not check 
them all exhaustively. Therefore, we chose to consider only small alphabets B a , containing 
a together with a few other letters that are always present in a good seed alphabet. Those 
letters are {_, @, The experiments reported in Section |V] use the alphabet B a = {_,«}. 

Given B a , we define the score of a as follows. We enumerate all seeds of a given span 
(typically, 5 or 6) over B a , and compute the sensitivity and selectivity of each seed according 



{CFYWMLIVGPATSNHQEDRK} 
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{CFYW} {MLIV} {GPATS} {NHQEDRK} 
{CFYW} {MLIV} {G} {PATS} {NHQEDRK} 
{C} {FYW} {MLIV} {G} {PATS} {NHQEDRK} 
{C} {FYW} {MLIV} {G} {P} {ATS} {NHQEDRK} 
{C} {FY} {W} {MLIV} {G} {P} {ATS} {NHQEDRK} 
{C} {F} {Y} {W} {MLIV} {G} {P} {ATS} {NHQEDRK} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {TS} {NHQEDRK} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {T} {S} {NHQEDRK} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {T} {S} {NHQED} {RK} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {T} {S} {NHQED} {R} {K} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {T} {S} {NH} {QED} {R} {K} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {T} {S} {N} {H} {QED} {R} {K} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {T} {S} {N} {H} {QE} {D} {R} {K} 
{C} {F} {Y} {W} {MLIV} {G} {P} {A} {T} {S} {N} {H} {Q} {E} {D} {R} {K} 
{C} {F} {Y} {W} {ML} {IV} {G} {P} {A} {T} {S} {N} {H} {Q} {E} {D} {R} {K} 
{C} {F} {Y} {W} {M} {L} {IV} {G} {P} {A} {T} {S} {N} {H} {Q} {E} {D} {R} {K} 
{C} {F} {Y} {W} {M} {L} {1} {V} {G} {P} {A} {T} {S} {N} {H} {Q} {E} {D} {R} {K} 



Fig. 3 

Alphabet transitive-predefined designed using the tree of Figure[2] Each line corresponds to a 

seed letter (amino acid partition) 
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to the protocol described in Section IV-Bl Each seed is then associated with a point on a unit 
square with coordinates corresponding to sensitivity and selectivity (see plots in Figure [6] 
below). The distance of this point to point (1, 1), denoted p(a), measures how good the 
sensitivity and selectivity jointly are. Besides, the number of occurrences of a in the seed 
should be taken into account. Overall, we chose to compute the score of a letter by the 
following formula: 



where the sum is taken over all seeds n of a given span, and occ n (a) is the number of 
occurrences of a in it. 

Greedy algorithm. Once every seed letter has been assigned a score, we compute the greedy 
solution as follows. We compute the maximal subset L of locally good letters, i.e. letters a 
that score better than any letter a' such that {a, a'} ^ X. It can be shown that this subset 
is independent and is included in the solution computed by the greedy algorithm. Then we 
redefine E and 1 by E' = E \ L and T = {Z C E' \ Z U L e 1} and apply the algorithm 
recursively to the independence system (E',T). The union of all sets L of locally good 
letters computed along this procedure forms the solution of the greedy algorithm. 

Resulting alphabet. Figure [3] shows alphabet Transitive-predefined designed through 
the approach of this Section. The alphabet has been designed from the tree of Figure [2] and 
using the alphabet B a = {-, a} for assigning the score of a letter a. Each line in Figure [2] 
corresponds to a letter (amino acid partition). Among alphabets obtained by varying different 
parameters in scoring individual letters (such as the alphabet and seed spans used in the 
scoring procedure), alphabet Transit ive-predef ined produced best seeds and will be 
used in the experimental part of this work (Section |V]). 

B. Transitive alphabets using an ab initio clustering method 

Hierarchical clustering of amino acids. A prerequisite to the approach of Section IIV-AI 
is a given tree describing a hierarchical clustering of amino acid based on some similarity 
measure. In this section, we describe an approach that constructs ab initio a hierarchical clus- 
tering of amino acids, using a likelihood measure. The approach can be seen as constructing 
a hierarchy of connected components of a graph based on the likelihood relation considered 
in Section [Till (see Figure CO) trying to build components with high likelihood values. 

As in Section IIV-AI our goal here is to construct a family of seed letters verifying the 
hierarchy condition ©. This family will be obtained with a simple greedy neighbor-joining 
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clustering algorithm. We start with the partition of amino acids into 20 singletons. This 
partition corresponds to the # letter. For a current partition P = {C\, . . . ,C n }, iteratively 
apply the following procedure. 

1 For each pair of sets C k , Ci, 

1.1 consider the set Bridge(C k ,Ce) = {(a,i,aj)\ai G C k , aj G Cg}. 

1.2 compute ForeBridgeProb(k, £) = Yl{fij\ a i e ^ki a-j G Ce} and 
BackBridgeProb(k, £) = J2{^j\ a i ^ @k, a j G Ci}, 

1 .3 compute L(k, £) = ForeBridgeProb(k, £) / BackBridgeProb(k, £) 

2 Find the pair of sets (Ck,Ci) yielding the maximal L(k,£), 

3 Merge C k and Cg into a new set, obtaining a new partition. 

The rationale behind this simple procedure is that those two sets of amino acids are merged 
together which produce the maximal increment in the likelihood. An alternative method, when 
the likelihood of the whole resulting set is maximized, yields biased results, as sets with a 
high likelihood tend to "absorb" other sets. 

Resulting alphabet. An alphabet, called Transit ive-ab-initio, obtained with this 
greedy neighbor-joining approach is given in Figured! It will be used in experiments presented 
later in Section |V] 

C. Non-hierarchical alphabets 

Previous approaches (Sections 4.1 and 4.2) were based on requirement © specifying that 
letters of the seed alphabet should be embedded one into another to form a "nested" hierarchy. 
This requirement is biologically motivated and, on the other hand, computationally useful 
as it reduces considerably the space of possible letters. However, this requirement is not 
necessary to implement the direct indexing (see Introduction). Therefore, we also designed 
non-hierarchical alphabets in order to compare them to hierarchical ones. 

To design non-hierarchical alphabets, we used a heuristic generalizing the one of Sec- 
tion llV-Bl The heuristic consists of two stages: first, generate a big number (several thousands) 
of "reasonable" candidate letters, and then select from them an alphabet containing ~20 
transitive letters (not necessarily forming a hierarchy). 

The algorithm of the first stage exploits the standard paradigm of genetic algorithms: it 
consequently creates "generations" of transitive letters. The initial population consists of a 
single "identity" seed letter. At the fc-th iteration (k = 1, . . . , 19), each letter generates p 
descendants, each having (20 — k) sets. 
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{CFYWHMLIVPGQERKNDATS} 
{CFYWHMLIV} {PGQERKNDATS} 
{C} {FYWHMLIV} {PGQERKNDATS} 
{C} {FYWHMLIV} {P} {GQERKNDATS} 
{C} {FYWH} {MLIV} {P} {GQERKNDATS} 
{C} {FYWH} {MLIV} {P} {GATS} {QERKND} 
{C} {FYWH} {MLIV} {P} {G} {ATS} {QERKND} 
{C} {FYWH} {MLIV} {P} {G} {ATS} {QERK} {ND} 
{C} {FYW} {H} {MLIV} {P} {G} {ATS} {QERK} {ND} 
{C} {FYW} {H} {MLIV} {P} {G} {A} {TS} {QERK} {ND} 
{C} {FYW} {H} {MLIV} {P} {G} {A} {TS} {QE} {RK} {ND} 
{C} {FYW} {H} {ML} {IV} {P} {G} {A} {TS} {QE} {RK} {ND} 
{C} {FYW} {H} {ML} {IV} {P} {G} {A} {TS} {QE} {RK} {N} {D} 
{C} {FYW} {H} {ML} {IV} {P} {G} {A} {T} {S} {QE} {RK} {N} {D} 
{C} {FY} {W} {H} {ML} {IV} {P} {G} {A} {T} {S} {QE} {RK} {N} {D} 
{C} {FY} {W} {H} {ML} {IV} {P} {G} {A} {T} {S} {Q} {E} {RK} {N} {D} 
{C} {FY} {W} {H} {M} {L} {IV} {P} {G} {A} {T} {S} {Q} {E} {RK} {N} {D} 
{C} {FY} {W} {H} {M} {L} {1} {V} {P} {G} {A} {T} {S} {Q} {E} {RK} {N} {D} 
{C} {F} {Y} {W} {H} {M} {L} {1} {V} {P} {G} {A} {T} {S} {Q} {E} {RK} {N} {D} 
{C} {F} {Y} {W} {H} {M} {L} {1} {V} {P} {G} {A} {T} {S} {Q} {E} {R} {K} {N} {D} 



Fig. 4 

Alphabet trans it ive- ab-initio obtained with the method of Section HV-BI 



To generate descendants of a letter from the /c-th generation, we use the algorithm given in 
Section IIV-BI but maintain p (instead of just one) best partitions according to the likelihood 
of the "bridge". The [k + l)-th generation is selected among all descendants of the k-th 
generation by selecting those q letters a which have the highest likelihood ratio f(a)/b(a). 
With p = 100 and q = 500 the procedure gives about 8000 candidate letters. 

To select a small number of those letters to form an alphabet, we tried different heuristics 
based on the following two ideas: (1) letters with high likelihood ratio are preferred (2) 
alphabet letters should have a range of different weights. The second option produced a 
better alphabet. 

Resulting alphabet. We selected twenty letters out of about 8000 candidates by partitioning 
the candidates into twenty groups according to their weight ranging from to 1 with increment 
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0.05, and by picking in each group the letter with maximal likelihood. An alphabet obtained 
with the above heuristic, called Non-tree-transitive, is shown in Figure This 
alphabet will be used in the experiments reported in Section IVl 

{ARNDCQEGHILMK FPSTWYV} 
{ARNDQEGHILMKFPSTWYV} {C} 
{ARNDCQEHILMKFPSTWYV} {G} 
{ARNDQEHILMKFSTYV} {CGPW} 
{ARCQEHILMKFSTYV} {NDGPW} 
{ARNDCQEGHKPST} {ILMFWYV} 
{ARNDQEGHKST} {CILMFWYV} {P} 
{ARNDQEHKPST} {CW} {G} {ILMFYV} 
{ARNDQEKST} {CP} {GHW} {ILMFYV} 
{AGPST} {RNDQEHK} {C} {ILMFWYV} 
{APST} {RNDQEHK} {CW} {G} {ILMFYV} 
{AGST} {RNDQEK} {C} {HFWY} {ILMV} {P} 
{AST} {RNDQEK} {CH} {G} {ILMV} {FWY} {P} 
{AST} {RQEHK} {ND} {CP} {G} {ILMV} {FWY} 
{AST} {RQK} {NH} {DE} {C} {G} {ILMV} {FWY} {P} 
{A} {RQK} {N} {DE} {C} {G} {H} {ILMV} {FY} {P} {ST} {W} 
{A} {RK} {N} {DE} {C} {QH} {G} {ILV} {M} {FY} {P} {ST} {W} 
{A} {RQK} {ND} {C} {E} {G} {H} {IV} {LM} {FWY} {P} {ST} 
{A} {RK} {ND} {C} {Q} {E} {G} {H} {IV} {LM} {FWY} {P} {S} {T} 
{A} {RK} {N} {D} {C} {Q} {E} {G} {H} {IV} {L} {M} {FY} {P} {S} {T} {W} 
{A} {R} {N} {D} {C} {QE} {G} {H} {1} {L} {K} {M} {FWY} {P} {S} {T} {V} 
{A} {R} {N} {D} {C} {Q} {E} {G} {H} {1} {L} {K} {M} {F} {P} {S} {T} {W} {V} 



Fig. 5 

Non-hierarchical alphabet non-tree-irans hive designed with the algorithm of Section HV-CI 



V. Experiments 

This section describes the experiments we made to test the efficiency of seeds we de- 
signed with different methods of previous sections. Sections IV-AI - IV-CI describe the exper- 
imental protocol, from the assignment of background and foreground probabilities, to the 
seed design. In Section IV-D[ we analyze the power of different seed models proposed in 
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Sections UnHTVl with respect to probabilistic models. Then in Section IV-E[ we benchmark 
the performance of seeds built over different alphabets from Section [IV] against Blastp, 
on several reference protein databases. For Sections IV-DI and IV-El all relative experimental 
data including scripts, designed alphabets, seeds and seed families, and resulting sensitivity 
and selectivity measures, have been collected in a supplementary Web page available at 

|http : / /bioinf o . lif 1 . f r/yass/iedera_proteins/[ 

A. Probability assignment and alphabet generation 

First of all, we derived probabilistic models in accordance with the BLOSUM62 data from 
the original paper ll2~6l . We obtained the BLOCKS database (version 5) [j223 and the software 
of ll26ll to infer Bernoulli probabilities for the background and foreground alignment models. 
These probabilities have been used throughout the whole pipeline of experiments. 

Different seed alphabets have then been generated by the methods presented in Section HIT1 
(alphabet Non-transitive), Section HV-AI (alphabet Transitive-predefined), Sec- 
tion IIV-BI (alphabet Transit ive-ab-initio) and Section IIV-CI (alphabet 
Non-tree-transitive). 

B. Seed design 

To each alphabet, we applied a seed design procedure that we briefly describe now. Since 
each seed (or seed family) is characterized by two parameters - sensitivity and selectivity 
- it can be associated with a point on a 2-dimensional plot. Best seeds are then defined to 
be those which belong to the Pareto set among all seeds, i.e. those than cannot be strictly 
improved by increasing sensitivity, selectivity, or both. 

For different selectivity levels, we designed good seed families containing one to six 
individual seeds, among which the best family was selected. In each seed family, individual 
seeds have been chosen to have approximately the same weight, within 5% tolerance. This 
requirement is natural as in the case of divergent weights, seeds with lower weight would 
dominantly affect the performance. In practice, having individual seeds of similar weight 
allows an efficient parallel implementation (see e.g. ifTTl ). 

Estimation of sensitivity of individual seeds or seed families has been done with the algo- 
rithm described in HI and implemented in the Iedera software, available at http : / /bioinf o . lif l . f r/yass/i 
The selectivity of an individual seed has been computed according to the definition (Sec- 



January 21, 2009 



DRAFT 



1 / 

tion|nl). For a seed family, its selectivity has been lower-estimated by summing the background 
probabilities of individual seeds. 

Seed family design has been done using a hill climbing heuristic (see [28J, Il29l0 alternating 
seed generation and seed estimation steps. All experiments were conducted for alignment 
lengths 16 and 32. 

C. BLASTP and the vector seed family from /]?]/ 

Our goal is to compare between different seed design approaches proposed in this paper, 
but also to benchmark them against other reference seeding methods. We used two references: 
the Blastp seeding method and the family of vector seeds proposed in @J. Both of them 
use a score (or weight) resulting from the cumulative contribution of several neighboring 
positions to define a hit (see Introduction). Therefore, they use a more powerful (and also 
more costly to implement) formalism of seeding. 

To estimate the sensitivity and selectivity of those seeds, we modified our methods de- 
scribed in the previous section by representing an alignment by a sequence of possible 
individual scores. Foreground and background probability of each score is easily computed 
from those for amino acid pairs. After that, sensitivity and selectivity is computed similarly 
to the previous case. 

D. Results on theoretical models 

We compare the performance of the different approaches by plotting ROC curves of Pareto- 
optimal sets of seeds on the selectivity/sensitivity graph. The two plots in Figure [6] show 
the results for alignment length 16 and 32 respectively. Red and green polylines show the 
performance of Blastp with word size 3 and the vector seed family from Q, for different 
score thresholds. The other curves show the performances of different seed alphabets from 
Sections HIIHrVl represented by the Pareto-optimal seeds (seed families) that we were able to 
construct over those alphabets. As mentioned earlier in Section |V-B[ each time we selected 
the best seed family among those with different number of individual seeds. Typically (but not 
exclusively), points on the plots correspond to seed families with 4 to 6 seeds. Typically, the 
seed span ranges between 3 and 5 (respectively, 3 and 6) for alignment length 16 (respectively, 
32). Seeds with larger span (> 4) tend to occur in seed families with larger number of seeds 
(> 3). 
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We observe that seeds over the alphabet of Section [HI] (dark blue curve) are comparable 
in performance with the vector seed family from [4] and clearly outperform seeds over other 
alphabets. This result is interesting in itself, although in many cases this alphabet is not 
practical due to its incompatibility with the transitivity condition. 

As for the other alphabets, they roughly show a comparable performance among them. For 
the alignment length 16, our seeds perform comparably to Blastp, with a slightly better 
performance for high thresholds and a slightly worse performance for low thresholds. On the 
other hand, for alignments of length 32, our seeds clearly outperform Blastp. Note that the 
non-hierarchical alphabet from Section IIV-CI does not bring much of improvement, which 
might indicate that lifting condition © does not bring much of additional power. This point, 
however, requires further investigation. 

E. Results on real data 

We made large-scale tests of our seeds on real data by applying them to several main 
databases of protein alignments. Those databases are BaliBASE (version 3) [30J, HOM- 
STRAD dMl, IRMBase (version 1) 01, OXBench (version 1.3) 63, PFAM (release 
22) [34J, PREFAB (version 4) [35 J, and SMART (version 4) OBI 

First, since all above databases except for OXBench contain multiple alignments, we 
extracted from each of them a dataset of pairwise alignments. For this, pairs of aligned se- 
quences have been randomly extracted from multiple alignments and matching gaps removed. 
To avoid a bias induced by big (in terms of the number of sequences) multiple alignments, 
we selected a smaller fraction of pairwise alignments from big multiple alignments than from 
small ones: the number of selected alignments varied from order of n 2 for small alignments 
to y/n for big ones. The total number of alignment processed in our experiments varied from 
640 (IRMBase) to more than 250000 (PFAM). 

For all those datasets, we identified alignments detected by the Blastp seed for different 
score thresholds (word length 3, BLOSUM62 matrix, score threshold 10 to 13). On the other 
hand, for each BLASTP score threshold, we identified the closest seed family in the Pareto 
set (cf Section IV-BI) with equivalent or greater selectivity. This has been done for each of 
the three transitive alphabets proposed in Section [IV] Selected seeds can be found at the 

Supplementary material Web page |http : //bioinf o . lif 1 . f r/yass/iedera-proteins/| 

Results are shown on Figure |7] Both methods detect a very high fraction of alignments 
of IRMBase (all of them for thresholds 10 and 11). The poorest sensitivity is observed 
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on SMART where alignments represent small sequences of proteins domains of the same 
family. A relatively weak sensitivity on PREFAB is due to its method of obtaining alignments 
which is based on structural information and, at the first step, "does not incorporate sequence 
similarity". Finally, HOMSTRAD combines both structural information (using FUGUE 071 ) 
and sequence information (using PSI-BLAST 0) which explains a better performance of 
seed-based search in this case. 

Comparing the performance of subset seeds vs. Blastp, the former show clearly a better 
performance on BaliBASE, PREFAB and PFAM. On OXBench, HOMSTRAD and 
SMART, the obtained sensitivity is very close to that of Blastp. Globally, subset seeds 
show a better performance for higher selectivity levels (greater thresholds). 

VI. Conclusion 

In this paper, we studied the design of subset seeds for protein alignments, which is a 
very attractive seeding principle that does not use scores at the hitting stage of the align- 
ment procedure. The design of efficient subset seeds subsumes a design of appropriate seed 
alphabets, i.e. sets of seed letters that seeds can be built from. In this paper, we studied 
several approaches to designing alphabets. In Section [Till we considered the most general 
case when seed letters are only required to induce a symmetric binary relation on amino acids. 
In Section [IVl we focused on transitive seed alphabets, where seed letters are required to 
induce an equivalence relation. In Section IIV-AI we proposed an alphabet construction based 
on pre-defined hierarchical clusterings of amino acids, while in Section ITV-Bl we considered 
a construction based on an ad hoc clustering of amino acids based on the likelihood ratio 
measure. Finally, in Section IIV-CI we lifted the requirement of hierarchical clustering and 
considered alphabets with possibly "incompatible" letters (in the sense of embedding of 
equivalence classes). 

The main conclusion of our work is that although the subset seed model is less expressive 
than the method of cumulative score used in Blastp, carefully designed subset seeds can 
reach the same or even a higher performance. To put it informally, the use of the cumulative 
score in defining a hit can, without loss of performance, be replaced by a careful distinction 
between different amino acid matches without using any scoring system. From a practical 
point of view, subset seeds can provide a more efficient implementation, especially for large- 
scale protein comparisons, due to a much smaller number of accesses to the hash table. In 
particular, this can be very useful for parallel implementations or specialized hardware (see 
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e.g. ma, da). 

Interestingly, the Blast team reported recently in Il38ll that they used a reduced amino acid 
alphabet in order to allow for longer seeds while still keeping the hash table of acceptable size. 
(Note also that this idea has recently been independently applied in 11391 . in a slightly different 
context.) This is done, however, by translating one of the sequences into a compressed 
alphabet and still using neighborhoods and a cumulative hit criterion. In this work, we 
demonstrated that instead of that, one can apply carefully designed subset seeds to avoid using 
neighborhoods and scoring systems at the seeding stage, without sacrificing the performance. 

Note that the seed design heuristic sketched in Section IV-BI does not guarantee to compute 
optimal seeds, and therefore our seeds could potentially be further improved by a more 
advanced design procedure, possibly bringing a further increase in performance. This is 
especially true for seeds of large weight (due to a bigger number of those), for which our 
seed design procedure could produce non-optimal seeds, thus explaining some "drop-offs" 
in high-selectivity parts of plots of Figure [6l 

As far as further research is concerned, the question of efficient seed design remains an 
open issue. Improvements of the hill climbing heuristic used in this work are likely to be 
possible. 

Finally, it would be very interesting to further study the relationship between optimal seeds 
and seed letters those seeds contain. In particular, it often appeared in our experiments that 
optimal seeds contained "non-optimal" seed letters. Understanding this phenomenon is an 
interesting theoretical question for further study. 
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Fig. 6 

roc curves of seed performance measured on probabilistic models for alignment length 16 (above) 
and 32 (below). blue, cyan dark green and dark bleu curves represent pareto-optimal seed families 
constructed respectively over alphabets transitive — predefined, transitive— ab— initio, 
non — tree — transitive and no n - t r an s i t i ve . each point of these curves corresponds to a seed 
family, typically 3 to 5 seeds (respectively, 3 to 6 seeds) for alignment length 16 (respectively 32). 
Red and green polylines show the performance of Blastp with word size 3 and the vector seed 
family from ]4), for different score thresholds. 
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Fig. 7 

Sensitivity of subset seeds vs Blastp measured on benchmark alignment databases 
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